Impact of a New Attribute Extraction Algorithm on Web Page Classification
نویسندگان
چکیده
paper introduces a new algorithm for dimensionality reduction and its application on web page classification. A heterogeneous collection of web pages is used as the dataset. Selected attributes for classification are the textual content of pages. Using the offered algorithm, high dimension of attributes-words extracted from the pages-are projected onto a new hyper plane having dimensions equal to the number of classes. Results show that processing times of classification algorithms dramatically decrease with the offered reduction algorithm. This mostly relies on the number of attributes given to classifiers fall off. Accuracies of the classification algorithms also increase compared to tests run without using the proposed reduction algorithm.
منابع مشابه
A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification
In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...
متن کاملData Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملExtraction Techniques for Mining Services from Web Sources
The Web has established itself as the dominant medium for doing electronic commerce. Consequently the number of service providers, both large and small, advertising their services on the web continues to proliferate. Such web presences can range from a simple reference to the service provider in a referral page containing many such references to a full-blown web site of the service provider. Cr...
متن کاملA New Hybrid Method for Web Pages Ranking in Search Engines
There are many algorithms for optimizing the search engine results, ranking takes place according to one or more parameters such as; Backward Links, Forward Links, Content, click through rate and etc. The quality and performance of these algorithms depend on the listed parameters. The ranking is one of the most important components of the search engine that represents the degree of the vitality...
متن کاملWhich Who are They? People Attribute Extraction and Disambiguation in Web Search Results∗
People name search often returns a lot of Web pages containing the strings of personal names. Due to namesake, extracting target person attributes (such as birthday, occupation, affiliation, nationality, contact information, etc.) is expected to be helpful to differentiate documents related to different people and thus group documents related to the same person. This paper presents the methodol...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009